Optimizing active decision making using simulated decision making

ABSTRACT

A method and a computer implemented system for improving an active decision making process by using a simulation model of the decision making process. The simulation model is used to evaluate the impact of alternative decisions at a choice point, in order to select one alternative. The method or system may be integrated with an external system, like a manufacturing execution system. The simulation model may be stochastic, may be updated from monitoring the external system or the simulations, or may contain a Bayesian network

TECHNICAL FIELD

This invention relates in general to the field of decision making, andmore particularly, to the integration of simulated and active decisionmaking.

BACKGROUND ART

Decision making requires choosing among several alternatives. For adecision-making agent, this might involve selecting a specific actionfrom several alternative actions that are possible at any given point oftime. Active decision making involves repeating this selection of anappropriate action in real-time at subsequent points of time.

According to decision-theory, we should always make the decision thatmaximizes our future utility, where utility is some measure such asprofit or loss, pain or pleasure, or time. To make a real-time decision,we need to elaborate possible future decision sequences and choose thatimmediate decision that results in the highest utility. For example, inchess, we might be considering five possible moves and for each of thosefive moves, we might have to consider five responses from our opponent,and for each of those five responses we might have to consider fiveresponses from us . . . and so on, until we reach the end of the game,where the outcome is either a win, loss, or draw. If one of those movesis guaranteed to lead to a winning outcome, then that move is the moveof choice.

FIG. 1 shows a portion of the lookahead tree built with this strategy.Nodes in this tree represent states or situations of the chessboard;directed arcs represent moves that result in a new state. The arcsemanating from the root node represent the first player's movepossibilities; the ones from the level below that, the second player'smove possibilities; alternate layers represent the alternating movesbetween the two players. If moving the queen is guaranteed to lead to awinning outcome and the rest of the moves are not, then that is the moveof choice.

In general, it is not feasible to compute the actual outcome for a givenposition (except for those near the end) in real time because the fulllookahead tree is too large to search. As a result, most chess programslook ahead to a limited horizon, and at this horizon they return aheuristic estimate of the final outcome. A positive heuristic estimatemight signify a desirable state; a negative outcome, an undesirablestate; and a zero outcome, a neutral state. In particular, IBM's DeepBlue program used a weighted combination of material, position, king'ssafety and tempo for its heuristic function. For example, the materialportion might score each pawn 1, each bishop 3, each knight 4, each rook5, and the queen 9. Using such a lookahead, Deep Blue managed to defeatGary Kasparov, the world champion human chess player.

Branch-and-bound pruning is a typical approach to further restrict thelookahead tree. It allows pruning a path to a state if it can be provedthat that outcome at that state will not affect the value of anancestor. For example, FIG. 2 shows a lookahead tree where the object isto compute the minimum value over the tree. If the heuristic isguaranteed to be a lower-bound of the final backed-up value, then we canuse that property to prune an entire subtree, thus increasing theefficiency of lookahead. The figure shows that, after visiting the leftsubtree, the root's value with be less than or equal to three; if theheuristic underestimate at the left subtree is five, then that entiresubtree can be pruned according to the branch-and-bound principle.

FIG. 3 shows why branch-and-bound fails when uncertainty is present Inthis FIG., bundled arcs represent uncertainty. For example, the root'sright child is a single decision that has two possible outcomes, onewith probability 0.2 and the other with probability 0.8. This mightrepresent a machine producing a faulty part with probability 0.2 and anon-faulty one with probability 0.8. The final backed-up value at thisprobabilistic child is the weighted sum of the two sibling outcomes,where the weighting is according to the probabilities. Given that thecurrent pruning bound at the root is 3 (derived from the left child),the pruning bound at the child with the 0.2 arc becomes 3/0.2=15, whichis an increase in the pruning bound of the parent Since this increasewill happen at any node with uncertainty below it, the pruning boundgrows exponentially large and therefore becomes useless for pruning.This means that little or no pruning is possible when uncertainty isinvolved. As a result, is not feasible to produce deep lookahead treesin problems involving uncertainty unless alternate lookahead searchmethods are developed. Worse still, standard lookahead fails whenprobability distribution is continuous (e.g. the processing time formachine might be normally distributed) because the number of children isinfinite.

The traditional approach to model uncertainty relies on Markov DecisionProcesses (MDPs). An MDP consists of a tuple <S,A,p,G>:

-   -   S is a set of states that represent situations in the particular        world. For example, it might represent the set of possible        buffer values and the state of each machine (busy, idling,        producing a certain part). Just as in chess, the state space is        explicitly built through the application of actions and only        that portion of the state space necessary to make a decision is        enumerated. States in manufacturing encode the time and        in-process actions.    -   A is the set of actions (decisions). An action in our        manufacturing environment is to commit a particular resource to        a particular task. An action can also be a vector of parallel        actions, one for each resource.    -   p(t|a,s) is the probability of state t given action a in        state s. This is the transition function that reflects the        outcome of applying an action to a state. For example, it can        capture that a part might be produced with a certain defect with        a certain probability.    -   G(s) is a goal predicate that is true when a terminal state is        reached. In a make-to-order manufacturing environment, G(s) is        true when all orders have been fulfilled. In a make-to-stock        environment, this might capture stochastic (forecasted) demand        over a given time interval.    -   u(s) is the utility of that goal state. We plan on using        profit/loss as the utility. We assume that the state encodes any        profit or loss incurred along the path to the goal state—this        simplifies the presentation of the objective function below.        ${f\quad(s)} = \left\{ \begin{matrix}        {u\quad(s)} & {G\quad(s)} \\        {\max\left\{ {{\left( {\sum\limits_{t \in S}{p\quad\left( {{t\text{❘}a},s} \right)f\quad(t)}} \right)\text{❘}a} \in A} \right\}} & {otherwise}        \end{matrix} \right.$

Using this framework, it is possible to define an objective function:

Bellman introduced a form of this function in 1957, and others havesince elaborated it in fields of Operations Research and ArtificialIntelligence. According to decision theory, we want to choose thataction a that maximizes$\sum\limits_{t \in S}{p\quad\left( {{t\text{❘}a},s} \right){f(t)}}$for state s. The function f(t) is computed by lookahead.

In contrast to an artificial application like chess, the complexity ofreal-world applications makes lookahead more challenging. Sincegenerating and applying actions in real-world applications typicallytake much more time than that in a chess game, deep lookahead withbranch-and-bound pruning in not practical, even without uncertainty. Inother words, the problem with real-world MDPs is that they cannot beefficiently computed for deep lookahead.

Coming from a different background, U.S. Pat. No. 5,764,953 issued onJun. 9, 1998 to Collins et. al. describes an integration of active andsimulated decision making processes. However, the purpose of thatintegration was to reduce the cost and errors of simulation, that is,the active decision making process was used to improve the simulateddecision making process. Our integration of the two processes is for theopposite purpose—simulated decision making process is used to improvethe active decision making process. Because of this reversal of purpose,our integration is also technically very different from theirintegration.

Real-time decision-making in real-world manufacturing applications iscurrently dominated by dispatch rules. A dispatch rule is a fixed ruleused to rapidly make processing or transfer decisions. Examples include:

Kanban: produce a part only if required by a downstream machine.

CONWIP: maintain a constant set of items in each buffer.

First-come, first-served.

Choose the shortest route to get to a destination.

Dispatch rules have several problems. First, they are myopic: they don'ttake into account the future impact of their decisions. Any fixed,finite rule can capture only a finite portion of the manufacturingcomplexity. As a result, dispatch rules are notorious for makingnon-optimal decisions. Second, they do not take advantage of additionaldecision-making time that might be possible to improve decision-makingquality (say through lookahead). The traditional “control-oriented” viewis that a fixed dispatch rule is determined ahead of time, programmedinto a controller, and executed without further deliberation. Third,most dispatch rules do not take into account the particular target goalstate—they are applied blindly.

DISCLOSURE OF INVENTION

In view of the shortcomings of existing lookahead techniques for activedecision making, this invention is directed to a new lookahead processfor active decision making that integrates a simulated decision makingprocess.

A preferred embodiment of our invention uses a new type of objectivefunction, one that computes the expected outcome:${f(s)} = \left\{ \begin{matrix}{u(s)} & {G(\quad s)} \\{\sum\limits_{a \in A}{p\quad\left( {a\text{❘}s} \right)\left\{ {\sum\limits_{t \in S}{p\quad\left( {{t❘a},s} \right)f\quad(t)}} \right\}}} & {otherwise}\end{matrix} \right.$

In this embodiment, the decisions are made according to the same rule asbefore: choose that action a that maximizes$\sum\limits_{t \in S}{p\quad\left( {{t\text{❘}a},s} \right)f\quad(t)}$for state s; where f(t) is computed with the above expectation-basedfunction rather than the Bellman-style function. Instead of the actualexpected outcome, its estimate could be used by sampling some of theactions a and states s.

The function p(a|s) is called a stochastic policy, which is theprobability of action a given state s. This policy guides thedecision-maker by appropriately weighting the outcome of each branchduring the computation of the above objective function. For example,FIG. 5 shows that the expected value is 4.75 at the root. For multipledecision-making agents, this function defines a stochastic coordinationpolicy, which describes how all agents are likely to behave.

We have developed a sampling apparatus to compute this functionefficiently. This apparatus is based on the Monte Carlo principle ofsimulation: produce samples according to the underlying probabilitydistribution: in our case, to repeatedly sample paths to terminals whereeach choice point is chosen according randomly from the distributiondefined by p(a|s) and p(t|a,s); return the average value over multiplesamples. Clearly, this sampling apparatus will compute the aboveexpectation-based function as the number of samples gets large.

This sampling approach has several advantages. First and most importantit focuses the search only on those portions of the lookahead tree thatare likely to occur. This makes it computationally efficient Second, itcan be used to make real-time decisions where deliberation time can betraded against accuracy: the more samples the more accurate the resultin terms of difference with the expectation-based function. Finally, itcan be sped up by parallelism: multiple machines can compute differentsamples in parallel.

The major advantage of the expectation-based approach is that an agentcan take into account how the other agents are likely to behave ratherthan how they optimally behave. For example, in computer chess, theusual assumption is that the opponent will play optimally against us.This assumption makes chess programs play conservatively because theyassume a perfect opponent It might be possible to improve performance ifwe played to the opponent's likely moves rather than optimal moves.

In general, the stochastic policy becomes an integral part of thesimulation decision making model of the application. Each run of thissimulation model generates a new branch of the lookahead tree.

Our approach has several advantages over dispatch rules. First, it issituation specific. It is computationally simpler to make a decision fora specific state and target production goal through lookahead than it isto learn a general dispatch rule for all states and all goals. This isbecause lookahead only elaborates that portion of the lookahead treenecessary to make an immediate decision. Producing a full schedule offuture events is much more expensive. Second, it is not necessary tobuild a separate simulation model or to halt production in order to testthe lookahead-based approach. The lookahead model itself functions as asimulator—a smart one that includes future decisions. Third, it ispossible to learn the parameters of the model from factory-floor data,thus reducing the cost of deploying our system Finally, our approachscales with parallelism: we can distribute the decision making tomultiple agents, each representing a resource.

Additional features and advantages of the invention will be set forth inpart in the description that follows, and in part will be apparent fromthe description, or may be learned by practice of the invention. Theadvantages of the invention will be realized and attained by the systemparticularly pointed out in the written description and claims hereof,as well as in the accompanying drawings. It is to be understood thatboth the foregoing general description and the following detaileddescription are exemplary and explanatory only, and not restrictive ofthe invention, as claimed.

The accompanying drawings are included to provide a furtherunderstanding of the invention and are incorporated in and constitute apart of this specification. The drawings illustrate exemplaryembodiments of the invention and together with the description serve toexplain the principles of the invention.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 shows a portion of the lookahead tree built for chess. Nodes inthis tree represent states or situations of the chessboard; directedarcs represent moves that result in a new state. The arcs emanating fromthe root node represent the first player's move possibilities; the onesfrom the level below that, the second player's move possibilities;alternate layers represent the alternating moves between the twoplayers. If moving the queen is guaranteed to lead to a winning outcomeand the rest of the moves are not, then that is the move of choice.

FIG. 2 shows how branch-and-bound pruning works when no uncertainty ispresent. A node can be pruned as long as it can be proved not to affectthe value of an ancestor. In this case, the lower-bound property of theheuristic is used to prune the node.

FIG. 3 shows that branch-and-bound pruning fails when uncertainty ispresent. The reason is that the pruning bound grows exponentially witheach level of uncertainty.

FIG. 4 shows how the expected value is computed at the root

FIG. 5 shows our system architecture for decision-making on thefactory-floor. The factory model, which consists of a model of theeffects of each action and their likelihood, is used by a set ofdecision-making agents to make a decision. This decision takes effect onthe factory floor and the results of this decision are analyzed by thelearning system, which in turn modifies the parameters of the factorymodel. The figure also shows interfaces to the factory floorenvironment, which consists of database information about inventory,resource availability, calendars, suppliers, shift information andcustomer orders. Such functions are typically handled by vendors fromMRP, ERP, Supply-Chain Management, and Order Management

FIG. 6 shows a flexible manufacturing system.

FIG. 7 shows how each agent makes a decision for a particular state.Each agent generates a list of possible actions for itself. It thensamples the other possible actions of other agents according to thestochastic policy p(a|s). For each of these samples, it computes thelookahead. Next, each agent chooses that action that is associated withthe lowest average outcome (where average is derived from theaforementioned sampling process). Finally, each agent applies theaction. The resulting state becomes the new state and the cyclecontinues.

FIG. 8 describes the lookahead method. If the state is a terminal thenthe terminal's value is returned. This value represents the utility forthat state. For example, the utility might include the path cost plus aheuristic estimate. Or, it might be path cost, plus the final outcomevalue. If a terminal is not reached, then the set of actions is sampledaccording to p(a|s), across all decision-makers and the resulting actionset is applied by computing the next state. From this step, the loopcontinues to the first step (the terminal check).

FIG. 9 shows an example of a knowledge representation structure for thestochastic policy. The particular structure is that of a BayesianNetwork In principle, other structures such as neural nets or decisiontrees could be used to represent that policy.

FIG. 10 shows that the lookahead model can be used as a simulator thatinterleaves decision-making with simulation.

FIG. 11 shows the interfaces of Oasys, an application of this invention.

FIG. 12 shows the interleaving of execution and lookahead modes inOasys.

BEST MODE FOR CARRYING OUT THE INVENTION

We will now detail an exemplary embodiment, called Simulation-BasedReal-Time Decision-Making (SRDM), of the invention. One skilled in theart, given the description herein, will recognize the utility of thesystem of the present invention in a variety of contexts in whichdecision making problems exist. For example, it is conceivable that thesystem of the present invention may be adapted to decision makingdomains existent in organizations engaged in activities such astelecommunications, power generation, traffic management, medicalresource management, transportation dispatching, emergency servicesdispatching, inventory management, logistics, and others. However, forease of description, as well as for purposes of illustration, thepresent invention primarily will be described in the context of afactory environment with manufacturing activities.

FIG. 5 shows our system architecture for real-time factory-floordecision-making. The Factory Model consists of information such as thestructure of the factory, how each action affects the state of thefactory, how often resources fail, and how often are parts defective.This information is used by the Decision-Making Agents for lookahead anddecision-making. These agents make a decision independently and inparallel that takes effect on the Factory Floor. The Factory Floorresponds with updates to the state which is sensed through the sensors.This information is then fed through the Learning System, which in turn,updates the Factory Model. This cycle continuous indefinitely.

For a specific illustration, consider the routing problem in a simplereliable flexible (one-part with multiple routings) manufacturing systemas shown in FIG. 6. This system consists of 5 machines (A to E) arrangedin two layers and connected by various route segments. Identical partsarriving from the left side are completely processed when they depart atthe right side, after following any of the following alternative routes:A-C, A-D, B-D, or B-E. Thus, there are three decision opportunities:

1. New part: choose either Machine A or B.

2. After Machine A: choose either Machine C or D.

3. After Machine B: choose either Machine D or E.

Thus, the decision alternatives are Left (A, C, and D, respectively, forthe three decisions) or Right (, D, and E, respectively). The routesegments as well as the queues in front of each machine are FIFO(first-in-first-out). The operational objective (KPI) is to minimize theaverage lead time, that is, the average time a part spends in the system(from arrival to departure). In this illustration, the arrival andprocessing times are exponentially distributed—FIG. 6 also shows thecorresponding means (in Minutes). The travel time between each pair ofnodes is fixed to 2 Minutes.

In SRDM, the top-level decision maker uses the steps given in FIG. 7 tomake and apply decisions. In a state (or situation) s, it firstgenerates a list A of possible actions (or decisions or alternatives).For each action in A, it samples the actions of other decision makersbased on the stochastic policy p(a|s) (our invention covers the specialcase of uniform stochastic function, where all p(a|s) have identicalvalues for any given state). For each such sample, it computes thelookahead outcome using the steps given in FIG. 8. It then chooses andapplies the action with the lowest associated outcome.

To compute the lookahead outcome (see steps in FIG. 8), the decisionmaker keeps on applying the simulated actions of all decision makers(including itself) and sampling new actions until the lookahead depth(terminal) is reached. It repeats this several times and returns, asoutcome, the average of the utility of all the terminals. The utility isthe sum of the utility observed until reaching the terminal and theheuristics value of the terminal.

SRDM relies on a discrete event simulation model of the underlyingapplication. Though the simulation uses a fixed policy (deterministic orstochastic, manual or optimized), SRDM does not use that policy to makea decision in the current situation. Instead it runs several simulations(called look-aheads) for a small number of alternative decisions andthen selects the decision that optimizes the Key Performance Indicators(KPIs). In short, the look-ahead simulations overcome the myopia andrigidity of the underlying fixed policy by taking into account thelonger-term impact of each decision in the current situation. Eachlook-ahead simulation is used to compute the KPIs by combining the KPIsobserved during the look-ahead and the KPIs estimated from the terminalsituation in that look-ahead.

SRDM is defined by four key parameters:

-   1. Policy: Which fixed policy to use during the look-ahead    simulations?-   2. Depth: How long to run each look-ahead simulation?-   3. Width: How many look-ahead simulations to run for each decision    alternative?-   4. Heuristics: Which heuristics to use to estimate the KPIs at the    end of each look-ahead simulation? Heuristics are necessary to    estimate the KPIs for the work in progress.

For each decision opportunity, SRDM uses the simulation model togenerate the required number of depth-restricted look-ahead simulationsfor each alternative. The KPIs from these look-aheads are averaged andthe decision with the best aggregated KPI is chosen.

The real-time constraint is met as follows: SRDM starts with depth 0,where the fixed policy completely determines the decision SRDM keepsincrementing the depth until the available time runs out or the depthlimit is reached. Finally, it chooses the decision based on the lastdepth for which all the look-aheads were successfully completed. A moresophisticated version of SRDM interleaves both the depth and widthincrements to provide decisions with a desired statistical confidencelevel.

Typical examples of fixed policies for this case are:

-   -   Deterministic Policy: Choose the machine with the shortest queue        (break ties by choosing the machine on the Right).    -   Stochastic Policy: The probability of choosing a machine is        inversely proportional to its queue length.    -   Deterministic local linear: At D1, choose left if the expression        “xQ(A)+yQ(B)+z” is greater than 50, where Q(M) is the length of        the queue for a machine M and x,y,z are the optimized using an        offline procedure like OptQuest, a commercial simulation        optimization software.

We now describe SRDM's method to learn the stochastic policy from actualdecision-making experience using the Maximum Likelihood (ML) principleas the learning framework. SRDM learns the parameters associated withthe function p(a|s). For example, we may want to learn the probabilityof a resource committing to a particular task, given the stateinformation in each buffer and what other tasks are currently executing.According to ML, we choose those parameters θ that maximize thelikelihood of the data over. In our case, the data is a set of pastsituations (states) and the actions of each resource for those states.When the data is iid (independently and identically distributed), thissimplifies the ML task to choosing θ such that$\prod\limits_{j = 1}^{r}\quad{p\quad\left( {{a_{j}\text{❘}s_{j}};\theta} \right)}$is maximized (j is the data item number, r is the number of data items):

In particular, the function p(a|s) is represented as a Bayesian Networkand the ML task simplifies to updating the parameters the BN'sconditional. In essence, the updates to the parameters are updates tofrequencies of certain events in the data.

For example, the BN, whose schema is shown in FIG. 9, could representthe stochastic policy. Here, the a_(i)'s represent individual resourcesand the s_(i)'s represent attributes of the state (e.g. number ofobjects in each buffer and what tasks are currently running). Theassumption here is that all resources are probabilistically independent:${p\quad\left( {a_{1},a_{2},\ldots\quad,{a_{n}\text{❘}s_{1}},s_{2},\ldots\quad,s_{m}} \right)} = {\prod\limits_{i = 1}^{n}\quad{p\quad\left( {{a_{i}\text{❘}s_{1}},s_{2},\ldots\quad,s_{m}} \right)}}$

According to the structure defined in the BN above:${p\quad\left( {{a\text{❘}s_{1}},{s_{2}\ldots}\quad,s_{n}} \right)} = \frac{p\quad\left( {s_{1},{s_{2}\ldots}\quad,{s_{n}\text{❘}a}} \right)\quad p\quad(a)}{\sum\limits_{a \in A}{p\quad\left( {s_{1},{s_{2}\ldots}\quad,{s_{n}\text{❘}a}} \right)\quad p\quad(a)}}$

Also according the BN above, all of the state attributes areindependent:${p\quad\left( {s_{1},s_{2},\ldots\quad,{s_{n}\text{❘}a}} \right)} = {\prod\limits_{i = 1}^{m}\quad{p\quad\left( {s_{i}\text{❘}a} \right)}}$

Thus, the ML task simplifies to recording the probability of each stateattribute value given each resource action (i.e. which task eachresource commits to); the data comes from actual decisions. Fordiscrete-valued state attributes, this amounts to storing the frequencyof the attribute value given the resource action. For continuous-valuedstate attributes, we use a normal distribution for which we compute thesample mean and variance, given each resource action. Forcontinuous-valued conditional attributes (e.g. one continuous-valuedstate attribute conditional on another), we use a conditionalmultivariate normal distribution. In this distribution, a child node(indexed by 1) is normally distributed with mean μ₁+V₁₂V₂₂ ⁻¹(x₂−μ₂) andvariance V₁₁−V₁₂V₂₂ ⁻¹V₂₁, where μ and V are partitioned comformably as(all the parents are indexed by 2): $\mu = {{\begin{bmatrix}\mu_{1} \\\mu_{2}\end{bmatrix}\quad{and}\quad V} = \begin{bmatrix}V_{11} & V_{12} \\V_{21} & V_{22}\end{bmatrix}}$

The sample mean vector and co-variance matrix are easily computed fromthe data.

Of course, the stochastic policy need not be represented by a BayesianNetwork. Other methods such as neural networks, polynomial functions, ordecision trees could be used.

Whatever method is used, the learning approach has several advantages.First, it reduces the cost of model building as the same approach can beused to learn the transition-function p(t|a,s) from data. Second, itreduces the cost of model validation since the ML principle isprobabilistically sound and thus self-validating. Third, the lookaheadmodel itself can act as a “smart” simulator—one that takes into accountdecisions by each agent, obviating the development of a separatesimulation model for testing. Decision-making can be rigorouslyevaluated without disrupting the actual application. Fourth, theindependence assumption makes sampling easy to compute: we need onlycompute p(s|a) for the current state and for each possible resourceaction; from this we can compute p(a|s) for each possible action andsample from a according to p(a|s). The steps from p(s|a) to p(a|s) aredefined in the above equations.

Finally, all learning can be done prior to deployment since thelookahead-simulator can generate its own training data. As FIG. 10shows, the decision-maker (a resource) generates a decision and themodel simulates that decision and all the other decisions of otherdecision-makers up to the next decision-point. The effect of thosedecisions is then input into the learning system, which in turngenerates new parameters for the decision-maker.

As our decision-making engine is domain-independent and highly modular,it has the potential to be applied to other complex decision-makingtasks such as network routing, medical decision-making, transportationrouting, political decision-making, investment planning and militaryplanning. It can also be applied to multiple agents—where parallelaction sets a are assumed to be input. Finally, it can be applied in areal-time setting: rather than searching to terminals, it is possible tosearch to a fixed depth and return a heuristic estimate of the remainingutility instead. The depth begins at 0 and as long as there isdecision-making time, the depth is incremented by 1. The action ofchoice is the best (i.e. lowest utility) action associated with the lastcompleted depth.

Integration with a Simulation System

In another embodiment, this invention enhances an existing simulationsystem. Although we illustrate this by enhancing a specific simulationengine SLX; similar approach will work for other simulation systems. SLXis general purpose software for developing discrete event simulations.SLX provides an essential set of simulation mechanisms like eventscheduling, generalized wait-until, random variable generation, andstatistics collection. SLX also provides basic building blocks, such asqueues and facilities, for modeling a wide variety of systems at varyinglevels of detail. SLX provides no built-in functionality for real-timesynchronization or decision-making. Our enhancement, Oasys for SLX,enhances SLX in the following ways:

The user defines a specific performance measures for optimization.Performance measures include operational measures like throughput andcycle time as well as financial measures like profit and market share.

The user does not have to assign a fixed policy for each decision point.Instead, Oasys automatically chooses the decision such that theperformance measure is optimized. Oasys relaxes the optimizationrequirement by considering other constraints such as time. For example,give me the best answer possible within 20 seconds.

-   -   Oasys enhances process specifications by allowing        non-deterministic actions—these contentions are also resolved        during simulation such that the performance measure is        optimized.    -   Oasys continually learns so as to keep improving the        optimisation over time.    -   Oasys communicates and synchronizes with external real-time        monitoring and control systems.    -   Oasys communicates and synchronizes with users in real-time.

As shown in FIG. 11, Oasys interacts with the following externalsystems:

-   -   Controllers: Systems for controlling the execution systems.        These systems actually effect the chosen decision in the real        world.    -   Monitors: Systems for providing feedback from the execution        systems. These systems sense the actual change in the real        world. Sensors are needed because an external event may have        taken place in the world or the actions might not have had their        intended effect

In addition, it provides a front-end to the users for providingreal-time instructions.

Oasys consists of three simulation models:

-   -   User model: for simulating users actions    -   Execution model: for simulating the actions of execution systems    -   Control model: for simulating the real-time decisions

The control model interacts with both the user and the execution model.

At any decision point in the control model, Oasys uses SLX to perform alook-ahead with a finite number of alternatives. Each look-aheadinvolves running one or more simulations for some period of time,determining the performance values at the end of each of thosesimulations, and combining those values to obtain one set of performancevalues. Oasys then chooses the best alternative that may getcommunicated to the user. The user may choose to accept or override thisrecommendation—the final decision may then get communicated to theexternal control systems.

Thus, Oasys alternates between the following two modes, as shown in FIG.12:

-   -   Execution mode: Oasys interacts with external execution systems        and users.    -   Look-ahead mode: Instead of interacting with external execution        systems and users, the control model interacts with execution        and user models, respectively.

At each decision point, after the look-ahead, Oasys presents itsrecommendation to the user along with the expected performance values ofall the alternatives. The user must take one of the following actions:

-   -   Select the recommendation: Oasys complies with that decision and        continues until the next set of recommendations.    -   Select an alternative: Oasys complies with that decision and        continues until the next set of recommendations.    -   Forgo selection: Oasys will wait for a specified amount of time,        and then select its recommended decision by default

Each decision selected by the user is immediately communicated to theexecution system. Each observation made by the execution system isimmediately communicated to Oasys, which relays it to SLX.

The decision points are defined by portions of the simulation wheremultiple outcomes are under our control. The execution-relevant state isexplicitly saved before lookahead and restored whenever needed.

The nodes of the look-ahead tree represent portions of the simulationthat are deterministic. The children of a node represent the result ofalternative events. Each node is processed (possibly several times) inone of the following ways:

-   -   Terminal node: The performance forecast for the node and the        performance heuristic are combined to produce the performance        forecast for the decision.    -   A new child is generated: If this is the first node, a new child        is generated according to the next decision we wish to try,        otherwise a child is generated according to a probability        distribution and processing passes to the child.

Thus, the following could be specified by the designer:

-   -   What are the terminal nodes?    -   How are children generated? E.g. probabilistic sampling.    -   How are performance measures combined? E.g. average, weighted by        probabilities. This requires more details on the consequences of        the event of going from parent to child.

The following are learned, indexed by the relevant parts of the state:

-   -   User model: To anticipate the effects of users' actions.    -   Execution system model: To anticipate the effects of        observations made by the execution system.    -   Performance forecast: To predict the performance measure when a        new node is encountered (before any look-ahead for that node).        Other Exemplary Embodiments

Instead of choosing the action with the lowest average outcome given anystate s, an alternative embodiment of our invention uses the followingapproach, where MIN-D and INC-D and MAX-D are depth parameters (positivenumbers), MIN-C and INC-C are confidence parameters (numbers between 0and 100), and MIN-N and INC-N (positive numbers) are iteration limits:

1. Set the lookahead depth D to MIN-D, the confidence level C to MIN-Cand the no of samples N to MIN-N

2. Set a to be the result of LookAhead (s, D, C, N)

3. Present a to the decision maker and get the next command

4. If next command is “increase confidence” increment C by INC-C and goto step 2

5. If next command is “increase depth” increment D by INC-D and go tostep 2

6. If next command is “increase iterations”, increment N by INC-N and goto step 2

7. If next command is “commit action”, stop

The function LookAhead(s, D, C, N) is calculated by the following steps,where U(a) define the distribution U1(a), U2(a) . . . , Un(a) for eachaction a with mean S(a) and standard deviation D(a) and let M(a) be theactual mean being estimated by the sample mean S(a):

1. Set n=1

2. For each alternative action a in state s, perform lookahead to getutility Un(a)

3. Partition the actions into two categories, a non empty Indifferenceset I and Reject set R, such that

a. For any two actions a in I and b in R, the confidence that M(a)>M(b)is more than C.

b. For any two actions a and b in I, the confidence level that M(a)>M(b)is at most C.

4. If R is empty and n<N, increment n by 1 and go to step 2

5. Return any action from the set I

If the number of possible actions in a state is large or infinite, onlya few of the most probable action are considered, by sampling them usingthe stochastic policy P(a|s) at state s. The cutoff could be specifiedin several ways, for example, setting a limit on number of distinctactions.

There is an alternative way to define the sets I, by using anotherconfidence parameter B, no larger than C. This parameter B can also bevaried based on the next command, just like the parameter B.

Another approach is to use a stochastic policy to generate the initialprobabilities for various alternatives of a decision, and then uselook-ahead to refine these probabilities, until some terminationcriterion is satisfied.

In general, the look-ahead strategy can be specified using thefollowing:

1. Alternative generation, prioritization, and elimination.

2. Sampling sequence

3. Sample generation

4. Sample termination

5. Terminal heuristics computation

6. Sample KPI computation

7. KPI aggregation

8. Look-ahead termination

9. Alternative Selection

Instead of using a standard execution model simulator for the lookahead,a faster abstract model may be used. This model, implemented in ageneral purpose programming language like C++, may be learned usingobservations made on the real execution system or its simulator.

If there are multiple concurrent decisions to be made, one couldconstruct a dependency graph among them, based on whether a decisionimpacts another. Except for the cycles in this graph, the rest of thedecisions may be then serialized. For multiple inter-dependentdecisions, several approaches may be used:

1. Treat them as one complex decision (alternatives multiply)

2. Approximate them by a sequence of decisions (alternatives add up)

3. Intermediate approaches (may be based on standard optimizationapproaches like local search, beam search, evolutionary algorithms,etc.)

More Complex Belief Networks

Instead of using a simple belief network, alternative embodiments of ourinvention use more complex belief networks, including

1. Hierarchical variables in belief networks

2. Belief networks with abstract data types:

3. Belief networks with user-defined parameterized (re-usable)functions.

4. Higher-order Belief networks models with differentials

5. POMDPs (Partially Observable MDP) where the belief-vectordistribution is represented by a Belief network itself, which is updatedthrough the application of decisions and observations.

We detail these alternative embodiments below.

1. Belief Networks with Hierarchical Variables

An important method of dealing with complexity is to place things inhierarchy. For example, the animal kingdom's taxonomy makes it easy forscientists to understand where each organism is located in that kingdom.Zip codes, which are hierarchically coded for each region, make iteasier for the post office to distribute mail. The usual way of dealingwith such hierarchies in a Belief Network is to use an “excess” encodingthat represents non-sensical combinations as a zero probability. Forexample, if objects a particular universe are Male or Female and Femalesare additionally Pregnant or Non-Pregnant, then the usual way ofencoding such a hierarchy is to represent one node in the Belief Networkfor Sex and another node for Pregnant (a Boolean). This method requiresthe user to specify a zero probability for Male and Pregnant.

In our embodiment, there is only one node, namely representing Male orFemale and if Female, then Pregnant or Non-Pregnant. The values for thevariable at the node are three-fold: Male, Female/Pregnant, andFemale/Non-Pregnant Male and Female have their prior probabilities andPregnant is conditional on Female. This reduces the memory requirementsand makes it simpler to learn such information from data using standardleaning techniques.

This method of hierarchical variable can also be extended to clustering,where the hierarchies are given, but the parameters are not knownStandard learning methods such as EM can be used to fit the hierarchiesto data.

Finally, these hierarchies can be generalized to partial orders, whereobjects may belong to more than one parent class. The extension isrelatively straightforward: a hierarchical variable can now beconditional on two or more parents, just as in standard Belief Networks.

2. Belief Networks with Abstract Data Types

Abstract data types in our embodiment include but are not limited to:

-   -   Stacks    -   Queues    -   Priority Queues    -   Records    -   Matrices    -   Association Lists    -   Count Lists    -   Existence tables    -   Sets    -   Bags

Programming languages have long used such data types to make it simplerfor programmers to develop their applications without having to specifyfunctions that operate on those types or the details of how thosefunctions operate. The motivation for use in Belief Networks is similar:after specifying the type of each variable, the user can specify howsuch a variable changes with each decision using the built-in functionsand tests that operate on those types. For example, in manufacturing,the user often requires queues to describe a factory process and thebuilt-in type Queue makes it simple for the user to specify a queue.

3. Higher-Order Belief Networks with Differentials

Sometimes one would like to specify the effects of a decision or apolicy in terms of information from previous states. The stochasticpolicy can be generalized to include information from previous states.For example p(a|s,t) captures the idea of that the probability of anaction depends on the state s and some other previous state t, whichcould be a vector of previous states. Similarly, the transition functioncan be generalized to include information from previous states:p(w|a,s,t).

More generally, the policy and transition might depend on a differentialvector of previous states rather than the states themselves. Forexample, the acceleration (a first-order difference), might be requiredfor decision-making in an application for an autonomous guided vehicle.

4. Belief Networks with User-Defined Functions

To declare a function that a user can reuse, the user must specifycertain information as in any modern programming language: parameters,local variables, and other functions within their scope. All of thesecan be referred to in the Bayes Network for the definition of afunction. Once defined, functions can be re-used just as any built-infunction. This is an important way for the user to extend the set ofbuilt-in functions to suit a particular application and to facilitatere-use.

The function definition involves the specification of a Belief Networkpossibly with multiple nodes, but with a single output node representingthe result of the function. The other nodes can be a combination ofparameters, local variables, or variables global in scope to thatfunction. Each node can reference other functions or recursivelyreference the current function.

5. Belief Networks with POMDPs as Embodied by Belief NetworkRepresentations of the Distribution

In a POMDP, the state is represented by a belief vector that capturesthe probability distribution associated with a particular state. Ouradditional embodiment is to represent this belief vector as a BeliefNetwork itself. This network can be made arbitrarily complex, dependingon the user's specification. For example, the x, y position of a vehiclemight be a multi-variate normal distribution or it might be a univariatenormal distribution. Events (actions or observations) cause this BeliefNetwork to be updated according to the underlying transition function.However, unlike in standard MDPs or standard POMDPs, observations causea change in state just as with actions. The way this change in statetakes place can include, but is not limited to: user-specifiedoperations on the Belief Network and multiple sampling of the beliefnetwork according the transition function. In the case of sampling, theweights of the network are adjusted for each prior combination ofvariables, for each child-node and parent nodes. This compactrepresentation of a belief vector makes allows the solution of POMDPs ofgreater complexity than before. Moreover, it generalizes approaches suchas Kalman Filtering.

Having described the exemplary embodiments of the invention, additionaladvantages and modifications will readily occur to those skilled in theart from consideration of the specification and practice of theinvention disclosed herein. Therefore, the specification and examplesshould be considered exemplary only, with the true scope and spirit ofthe invention being indicated by the enclosed claims.

1. A method for optimizing an active decision making process thatrequires selecting actions at a sequence of choice points, comprising:a. creating a simulation model for the active decision making processcomprising the potential effects of an action; b. generating a pluralityof alternative actions at a choice point in the active decision makingprocess; c. for one of these alternative actions, generating asimulation of the future decision making process using the simulationmodel; and d. analyzing the result of this simulation to select anaction for the choice point.
 2. The method of claim 1, wherein thesimulation model comprises a stochastic component.
 3. The method ofclaim 2, wherein the stochastic component comprises a policy forchoosing among alternative decisions.
 4. The method of claim 1, whereintwo simulations are interleaved such that one simulation starts beforeanother ends.
 5. The method of claim 1, wherein the simulation modelcomprises of a Bayesian network.
 6. The method of claim 1, wherein thesimulation model comprises a component selected from the groupconsisting of hierarchical variables, abstract data types, differentialvector of previous states, user-defined functions, Markov decisionprocesses, partially-observable Markov decision processes, heuristicsevaluation function, user model for simulating users of the activedecision making process, execution model for simulating an externalapplication, and control model for simulating the active decision makingprocess.
 7. The method of claim 1, further integrating the activedecision making process with an external application.
 8. The method ofclaim 7, wherein the external application comprises a simulation system.9. The method of claim 7, wherein the simulation model is updated usingthe data obtained by monitoring the external application.
 10. The methodof claim 1, wherein the simulation model is updated using the result ofthe simulation.
 11. A computer implemented system for optimizing anactive decision making process that requires selecting actions at asequence of choice points, comprising: a. a simulation model for theactive decision making process comprising the potential effects of anaction; b. generation of a plurality of alternative actions at a choicepoint in the active decision making process; c. for one of thesealternative actions, generation of a simulation of the future decisionmaking process using the simulation model; and d. analysis of the resultof this simulation to select an action for the choice point.
 12. Thesystem of claim 11, wherein the simulation model comprises a stochasticcomponent.
 13. The system of claim 12, wherein the stochastic componentcomprises policy for choosing among alternative decisions.
 14. Thesystem of claim 11, wherein two simulations are interleaved such thatone simulation starts before another ends.
 15. The system of claim 11,wherein the simulation model comprises of a Bayesian network.
 16. Thesystem of claim 11, wherein the simulation model comprises a componentselected from the group consisting of hierarchical variables, abstractdata types, differential vector of previous states, user-definedfunctions, Markov decision processes, partially-observable Markovdecision processes, heuristics evaluation function, user model forsimulating users of the active decision making process, execution modelfor simulating an external application, and control model for simulatingthe active decision making process.
 17. The system of claim 11, furtherintegrating the active decision making process with an externalapplication.
 18. The system of claim 17, wherein the externalapplication comprises a simulation system.
 19. The system of claim 17,wherein the simulation model is updated using the data obtained bymonitoring the external application.
 20. The system of claim 11, whereinthe simulation model is updated using the result of the simulation.